Seleção da Database¶

Database:¶

Brazilian E-Commerce Public Database by Olist https://www.kaggle.com/olistbr/brazilian-ecommerce?select=olist_order_items_dataset.csv

Descrição:¶

A database foi provida pela Olist, uma empresa de marketplace situada no Brasil. A Olist conecta pequenos negócios por todo o Brasil de uma forma simples. Vendedores conseguem vender seus produtos diretamente pela Olist Store, as entregas são feitas por meio de parceiros.

Problema de Negócio:¶

Quais são os principais fatores que alavancam o número de vendas? Quais os fatores que impactam na avaliação do cliente?

Observação:¶

Esse é o segundo notebook, onde iremos realizar clusterização e analisar os mesmos afim de promover insights. No primeiro notebook com o prefixo '01', realizamos toda a visualização do conjunto, onde foram retirados alguns insights que serão utilizados ao longo das proximas etapas.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import re
import plotly.express as px
import plotly.graph_objects as go
import dash
import scipy.stats as stats

from warnings import simplefilter
from functools import reduce
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from math import ceil
from sklearn.cluster import KMeans, DBSCAN, AffinityPropagation
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from scipy.spatial.distance import cdist, pdist
from sklearn.metrics import silhouette_score
from matplotlib.axes._axes import _log as matplotlib_axes_logger
from IPython.display import HTML, display
In [2]:
# Versões dos pacotes usados neste jupyter notebook
%reload_ext watermark
%watermark -a "Herikc Brecher" --iversions
Author: Herikc Brecher

pandas    : 1.4.2
seaborn   : 0.11.2
re        : 2.2.1
matplotlib: 3.5.1
plotly    : 5.6.0
dash      : 2.6.2
numpy     : 1.20.0
scipy     : 1.5.4

In [3]:
simplefilter(action='ignore', category=Warning)
matplotlib_axes_logger.setLevel('ERROR')
%matplotlib inline
sns.set_theme()
In [4]:
# Seed de Aleatoriedade
seed_ = 194
np.random.seed(seed_)
In [5]:
display(HTML("<style>.container { width:100% !important; }</style>"))

1. Carregamento dos Dados¶

image.png

In [6]:
# Carregamento de todos datasets
dtCustomers = pd.read_csv('../data/olist_customers_dataset.csv', encoding = 'utf8', dtype={'customer_zip_code_prefix': str})
dtGeolocation = pd.read_csv('../data/olist_geolocation_dataset.csv', encoding = 'utf8', dtype={'geolocation_zip_code_prefix': str})
dtOrderItems = pd.read_csv('../data/olist_order_items_dataset.csv', encoding = 'utf8')
dtOrderPayments = pd.read_csv('../data/olist_order_payments_dataset.csv', encoding = 'utf8')
dtOrderReviews = pd.read_csv('../data/olist_order_reviews_dataset.csv', encoding = 'utf8')
dtOrders = pd.read_csv('../data/olist_orders_dataset.csv', encoding = 'utf8')
dtProducts = pd.read_csv('../data/olist_products_dataset.csv', encoding = 'utf8')
dtSellers = pd.read_csv('../data/olist_sellers_dataset.csv', encoding = 'utf8')

2. Pre-Processamento¶

2.1 Feature Engineering¶

Iremos adicionar uma variavel extra para todos os datasets que contém o Estado. A variavel adicionada é a Região referente ao Estado.

In [7]:
regioes = {
    'AC': 'Norte',
    'AL': 'Nordeste',
    'AP': 'Norte',
    'AM': 'Norte',
    'BA': 'Nordeste',
    'CE': 'Nordeste',
    'DF': 'CentroOeste',
    'ES': 'Sudeste',
    'GO': 'CentroOeste',
    'MA': 'Nordeste',
    'MT': 'CentroOeste',
    'MS': 'CentroOeste',
    'MG': 'Sudeste',
    'PA': 'Norte',
    'PB': 'Nordeste',
    'PR': 'Sul',
    'PE': 'Nordeste',
    'PI': 'Nordeste',
    'RJ': 'Sudeste',
    'RN': 'Nordeste',
    'RS': 'Sul',
    'RO': 'Norte',
    'RR': 'Norte',
    'SC': 'Sul',
    'SP': 'Sudeste',
    'SE': 'Nordeste',
    'TO': 'Norte'
}
In [8]:
dtCustomers['regiao'] =  dtCustomers['customer_state'].copy()
In [9]:
dtCustomers = dtCustomers.replace({'regiao': regioes})
In [10]:
dtCustomers.head()
Out[10]:
customer_id customer_unique_id customer_zip_code_prefix customer_city customer_state regiao
0 06b8999e2fba1a1fbc88172c00ba8bc7 861eff4711a542e4b93843c6dd7febb0 14409 franca SP Sudeste
1 18955e83d337fd6b2def6b18a428ac77 290c77bc529b7ac935b93aa66c333dc3 09790 sao bernardo do campo SP Sudeste
2 4e7b3e00288586ebd08712fdd0374a03 060e732b5b29e8181a18229c7b0b2b5e 01151 sao paulo SP Sudeste
3 b2b6027bc5c5109e529d4dc6358b12c3 259dac757896d24d7702b9acbbff3f3c 08775 mogi das cruzes SP Sudeste
4 4f2d8ab171c80ec8364f7c12e35b23ad 345ecd01c38d18a9036ed96c73b8d066 13056 campinas SP Sudeste
In [11]:
dtSellers['regiao'] =  dtSellers['seller_state'].copy()
In [12]:
dtSellers = dtCustomers.replace({'regiao': regioes})
In [13]:
dtSellers.head()
Out[13]:
customer_id customer_unique_id customer_zip_code_prefix customer_city customer_state regiao
0 06b8999e2fba1a1fbc88172c00ba8bc7 861eff4711a542e4b93843c6dd7febb0 14409 franca SP Sudeste
1 18955e83d337fd6b2def6b18a428ac77 290c77bc529b7ac935b93aa66c333dc3 09790 sao bernardo do campo SP Sudeste
2 4e7b3e00288586ebd08712fdd0374a03 060e732b5b29e8181a18229c7b0b2b5e 01151 sao paulo SP Sudeste
3 b2b6027bc5c5109e529d4dc6358b12c3 259dac757896d24d7702b9acbbff3f3c 08775 mogi das cruzes SP Sudeste
4 4f2d8ab171c80ec8364f7c12e35b23ad 345ecd01c38d18a9036ed96c73b8d066 13056 campinas SP Sudeste
In [14]:
dtGeolocation['regiao'] =  dtGeolocation['geolocation_state'].copy()
In [15]:
dtGeolocation = dtGeolocation.replace({'regiao': regioes})
In [16]:
dtGeolocation.head()
Out[16]:
geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state regiao
0 01037 -23.545621 -46.639292 sao paulo SP Sudeste
1 01046 -23.546081 -46.644820 sao paulo SP Sudeste
2 01046 -23.546129 -46.642951 sao paulo SP Sudeste
3 01041 -23.544392 -46.639499 sao paulo SP Sudeste
4 01035 -23.541578 -46.641607 sao paulo SP Sudeste

2.2 Construindo Datasets¶

Devido ao problema em questão ser dividido em 8 datasets diferentes, iremos construir datasets mais unificados afim de facilitar o processo de tratamento dos dados e clusterização.

In [17]:
dtCustomers.head()
Out[17]:
customer_id customer_unique_id customer_zip_code_prefix customer_city customer_state regiao
0 06b8999e2fba1a1fbc88172c00ba8bc7 861eff4711a542e4b93843c6dd7febb0 14409 franca SP Sudeste
1 18955e83d337fd6b2def6b18a428ac77 290c77bc529b7ac935b93aa66c333dc3 09790 sao bernardo do campo SP Sudeste
2 4e7b3e00288586ebd08712fdd0374a03 060e732b5b29e8181a18229c7b0b2b5e 01151 sao paulo SP Sudeste
3 b2b6027bc5c5109e529d4dc6358b12c3 259dac757896d24d7702b9acbbff3f3c 08775 mogi das cruzes SP Sudeste
4 4f2d8ab171c80ec8364f7c12e35b23ad 345ecd01c38d18a9036ed96c73b8d066 13056 campinas SP Sudeste
In [18]:
dtGeolocation.head()
Out[18]:
geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state regiao
0 01037 -23.545621 -46.639292 sao paulo SP Sudeste
1 01046 -23.546081 -46.644820 sao paulo SP Sudeste
2 01046 -23.546129 -46.642951 sao paulo SP Sudeste
3 01041 -23.544392 -46.639499 sao paulo SP Sudeste
4 01035 -23.541578 -46.641607 sao paulo SP Sudeste
In [19]:
dtOrderItems.head()
Out[19]:
order_id order_item_id product_id seller_id shipping_limit_date price freight_value
0 00010242fe8c5a6d1ba2dd792cb16214 1 4244733e06e7ecb4970a6e2683c13e61 48436dade18ac8b2bce089ec2a041202 2017-09-19 09:45:35 58.90 13.29
1 00018f77f2f0320c557190d7a144bdd3 1 e5f2d52b802189ee658865ca93d83a8f dd7ddc04e1b6c2c614352b383efe2d36 2017-05-03 11:05:13 239.90 19.93
2 000229ec398224ef6ca0657da4fc703e 1 c777355d18b72b67abbeef9df44fd0fd 5b51032eddd242adc84c38acab88f23d 2018-01-18 14:48:30 199.00 17.87
3 00024acbcdf0a6daa1e931b038114c75 1 7634da152a4610f1595efa32f14722fc 9d7a1d34a5052409006425275ba1c2b4 2018-08-15 10:10:18 12.99 12.79
4 00042b26cf59d7ce69dfabb4e55b4fd9 1 ac6c3623068f30de03045865e4e10089 df560393f3a51e74553ab94004ba5c87 2017-02-13 13:57:51 199.90 18.14
In [20]:
dtOrderPayments.head()
Out[20]:
order_id payment_sequential payment_type payment_installments payment_value
0 b81ef226f3fe1789b1e8b2acac839d17 1 credit_card 8 99.33
1 a9810da82917af2d9aefd1278f1dcfa0 1 credit_card 1 24.39
2 25e8ea4e93396b6fa0d3dd708e76c1bd 1 credit_card 1 65.71
3 ba78997921bbcdc1373bb41e913ab953 1 credit_card 8 107.78
4 42fdf880ba16b47b59251dd489d4441a 1 credit_card 2 128.45
In [21]:
dtOrderReviews.head()
Out[21]:
review_id order_id review_score review_comment_title review_comment_message review_creation_date review_answer_timestamp
0 7bc2406110b926393aa56f80a40eba40 73fc7af87114b39712e6da79b0a377eb 4 NaN NaN 2018-01-18 00:00:00 2018-01-18 21:46:59
1 80e641a11e56f04c1ad469d5645fdfde a548910a1c6147796b98fdf73dbeba33 5 NaN NaN 2018-03-10 00:00:00 2018-03-11 03:05:13
2 228ce5500dc1d8e020d8d1322874b6f0 f9e4b658b201a9f2ecdecbb34bed034b 5 NaN NaN 2018-02-17 00:00:00 2018-02-18 14:36:24
3 e64fb393e7b32834bb789ff8bb30750e 658677c97b385a9be170737859d3511b 5 NaN Recebi bem antes do prazo estipulado. 2017-04-21 00:00:00 2017-04-21 22:02:06
4 f7c4243c7fe1938f181bec41a392bdeb 8e6bfb81e283fa7e4f11123a3fb894f1 5 NaN Parabéns lojas lannister adorei comprar pela I... 2018-03-01 00:00:00 2018-03-02 10:26:53
In [22]:
dtOrders.head()
Out[22]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date
0 e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d delivered 2017-10-02 10:56:33 2017-10-02 11:07:15 2017-10-04 19:55:00 2017-10-10 21:25:13 2017-10-18 00:00:00
1 53cdb2fc8bc7dce0b6741e2150273451 b0830fb4747a6c6d20dea0b8c802d7ef delivered 2018-07-24 20:41:37 2018-07-26 03:24:27 2018-07-26 14:31:00 2018-08-07 15:27:45 2018-08-13 00:00:00
2 47770eb9100c2d0c44946d9cf07ec65d 41ce2a54c0b03bf3443c3d931a367089 delivered 2018-08-08 08:38:49 2018-08-08 08:55:23 2018-08-08 13:50:00 2018-08-17 18:06:29 2018-09-04 00:00:00
3 949d5b44dbf5de918fe9c16f97b45f8a f88197465ea7920adcdbec7375364d82 delivered 2017-11-18 19:28:06 2017-11-18 19:45:59 2017-11-22 13:39:59 2017-12-02 00:28:42 2017-12-15 00:00:00
4 ad21c59c0840e6cb83a9ceb5573f8159 8ab97904e6daea8866dbdbc4fb7aad2c delivered 2018-02-13 21:18:39 2018-02-13 22:20:29 2018-02-14 19:46:34 2018-02-16 18:17:02 2018-02-26 00:00:00
In [23]:
dtProducts.head()
Out[23]:
product_id product_category_name product_name_lenght product_description_lenght product_photos_qty product_weight_g product_length_cm product_height_cm product_width_cm
0 1e9e8ef04dbcff4541ed26657ea517e5 perfumaria 40.0 287.0 1.0 225.0 16.0 10.0 14.0
1 3aa071139cb16b67ca9e5dea641aaa2f artes 44.0 276.0 1.0 1000.0 30.0 18.0 20.0
2 96bd76ec8810374ed1b65e291975717f esporte_lazer 46.0 250.0 1.0 154.0 18.0 9.0 15.0
3 cef67bcfe19066a932b7673e239eb23d bebes 27.0 261.0 1.0 371.0 26.0 4.0 26.0
4 9dc1a7de274444849c219cff195d0b71 utilidades_domesticas 37.0 402.0 4.0 625.0 20.0 17.0 13.0
In [24]:
dtSellers.head()
Out[24]:
customer_id customer_unique_id customer_zip_code_prefix customer_city customer_state regiao
0 06b8999e2fba1a1fbc88172c00ba8bc7 861eff4711a542e4b93843c6dd7febb0 14409 franca SP Sudeste
1 18955e83d337fd6b2def6b18a428ac77 290c77bc529b7ac935b93aa66c333dc3 09790 sao bernardo do campo SP Sudeste
2 4e7b3e00288586ebd08712fdd0374a03 060e732b5b29e8181a18229c7b0b2b5e 01151 sao paulo SP Sudeste
3 b2b6027bc5c5109e529d4dc6358b12c3 259dac757896d24d7702b9acbbff3f3c 08775 mogi das cruzes SP Sudeste
4 4f2d8ab171c80ec8364f7c12e35b23ad 345ecd01c38d18a9036ed96c73b8d066 13056 campinas SP Sudeste

2.2.1 Unificando Dataset de Pedidos¶

In [25]:
listDtMergeOrders = [dtOrderItems[['order_id', 'freight_value']], dtOrderPayments, dtOrderReviews, dtOrders,\
                    dtOrderItems.groupby('order_id').agg({'order_item_id': max}).reset_index()]
In [26]:
dtGeneralOrders = reduce(lambda left, right: pd.merge(left, right, on = 'order_id'), listDtMergeOrders)
In [27]:
dtGeneralOrders = pd.merge(dtGeneralOrders, dtCustomers[['customer_id', 'customer_unique_id', 'customer_city', 'customer_state', 'regiao']],\
                           on = 'customer_id')
In [28]:
dtGeneralOrders.head()
Out[28]:
order_id freight_value payment_sequential payment_type payment_installments payment_value review_id review_score review_comment_title review_comment_message ... order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date order_item_id customer_unique_id customer_city customer_state regiao
0 00010242fe8c5a6d1ba2dd792cb16214 13.29 1 credit_card 2 72.19 97ca439bc427b48bc1cd7177abe71365 5 NaN Perfeito, produto entregue antes do combinado. ... 2017-09-13 08:59:02 2017-09-13 09:45:35 2017-09-19 18:34:16 2017-09-20 23:43:48 2017-09-29 00:00:00 1 871766c5855e863f6eccc05f988b23cb campos dos goytacazes RJ Sudeste
1 00018f77f2f0320c557190d7a144bdd3 19.93 1 credit_card 3 259.83 7b07bacd811c4117b742569b04ce3580 4 NaN NaN ... 2017-04-26 10:53:06 2017-04-26 11:05:13 2017-05-04 14:35:00 2017-05-12 16:04:24 2017-05-15 00:00:00 1 eb28e67c4c0b83846050ddfb8a35d051 santa fe do sul SP Sudeste
2 000229ec398224ef6ca0657da4fc703e 17.87 1 credit_card 5 216.87 0c5b33dea94867d1ac402749e5438e8b 5 NaN Chegou antes do prazo previsto e o produto sur... ... 2018-01-14 14:33:31 2018-01-14 14:48:30 2018-01-16 12:36:48 2018-01-22 13:19:16 2018-02-05 00:00:00 1 3818d81c6709e39d06b2738a8d3a2474 para de minas MG Sudeste
3 00024acbcdf0a6daa1e931b038114c75 12.79 1 credit_card 2 25.78 f4028d019cb58564807486a6aaf33817 4 NaN NaN ... 2018-08-08 10:00:35 2018-08-08 10:10:18 2018-08-10 13:28:00 2018-08-14 13:32:39 2018-08-20 00:00:00 1 af861d436cfc08b2c2ddefd0ba074622 atibaia SP Sudeste
4 00042b26cf59d7ce69dfabb4e55b4fd9 18.14 1 credit_card 3 218.04 940144190dcba6351888cafa43f3a3a5 5 NaN Gostei pois veio no prazo determinado . ... 2017-02-04 13:57:51 2017-02-04 14:10:13 2017-02-16 09:46:09 2017-03-01 16:42:31 2017-03-17 00:00:00 1 64b576fb70d441e8f1b2d7d446e483c5 varzea paulista SP Sudeste

5 rows × 24 columns

In [29]:
# Mantendo somente Ano, Mes e Dia
dtGeneralOrders['order_purchase_timestamp'] = \
    pd.to_datetime(dtGeneralOrders['order_purchase_timestamp'])
In [30]:
# Mantendo somente Ano, Mes e Dia
dtGeneralOrders['order_delivered_customer_date'] = \
    pd.to_datetime(dtGeneralOrders['order_delivered_customer_date'])
In [31]:
# Mantendo somente Ano, Mes e Dia
dtGeneralOrders['order_estimated_delivery_date'] = \
    pd.to_datetime(dtGeneralOrders['order_estimated_delivery_date'])
In [32]:
dtGeneralOrders['diff_delivery_and_estimate'] = dtGeneralOrders['order_estimated_delivery_date'] -\
                                           dtGeneralOrders['order_delivered_customer_date']
In [33]:
dtGeneralOrders['diff_delivery_and_purchase'] = dtGeneralOrders['order_purchase_timestamp'] -\
                                           dtGeneralOrders['order_delivered_customer_date']
In [34]:
dtGeneralOrders['diff_delivery_and_estimate'] = \
                dtGeneralOrders['diff_delivery_and_estimate'].apply(lambda x: re.sub("[^0-9]", "", str(x)))
In [35]:
dtGeneralOrders['diff_delivery_and_purchase'] = \
                dtGeneralOrders['diff_delivery_and_purchase'].apply(lambda x: re.sub("[^0-9]", "", str(x)))
In [36]:
dtGeneralOrders['diff_delivery_and_estimate'] = pd.to_numeric(dtGeneralOrders['diff_delivery_and_estimate'])
In [37]:
dtGeneralOrders['diff_delivery_and_purchase'] = pd.to_numeric(dtGeneralOrders['diff_delivery_and_purchase'])
In [38]:
dtGeneralOrders.head()
Out[38]:
order_id freight_value payment_sequential payment_type payment_installments payment_value review_id review_score review_comment_title review_comment_message ... order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date order_item_id customer_unique_id customer_city customer_state regiao diff_delivery_and_estimate diff_delivery_and_purchase
0 00010242fe8c5a6d1ba2dd792cb16214 13.29 1 credit_card 2 72.19 97ca439bc427b48bc1cd7177abe71365 5 NaN Perfeito, produto entregue antes do combinado. ... 2017-09-19 18:34:16 2017-09-20 23:43:48 2017-09-29 1 871766c5855e863f6eccc05f988b23cb campos dos goytacazes RJ Sudeste 8001612.0 8091514.0
1 00018f77f2f0320c557190d7a144bdd3 19.93 1 credit_card 3 259.83 7b07bacd811c4117b742569b04ce3580 4 NaN NaN ... 2017-05-04 14:35:00 2017-05-12 16:04:24 2017-05-15 1 eb28e67c4c0b83846050ddfb8a35d051 santa fe do sul SP Sudeste 2075536.0 17184842.0
2 000229ec398224ef6ca0657da4fc703e 17.87 1 credit_card 5 216.87 0c5b33dea94867d1ac402749e5438e8b 5 NaN Chegou antes do prazo previsto e o produto sur... ... 2018-01-16 12:36:48 2018-01-22 13:19:16 2018-02-05 1 3818d81c6709e39d06b2738a8d3a2474 para de minas MG Sudeste 13104044.0 8011415.0
3 00024acbcdf0a6daa1e931b038114c75 12.79 1 credit_card 2 25.78 f4028d019cb58564807486a6aaf33817 4 NaN NaN ... 2018-08-10 13:28:00 2018-08-14 13:32:39 2018-08-20 1 af861d436cfc08b2c2ddefd0ba074622 atibaia SP Sudeste 5102721.0 7202756.0
4 00042b26cf59d7ce69dfabb4e55b4fd9 18.14 1 credit_card 3 218.04 940144190dcba6351888cafa43f3a3a5 5 NaN Gostei pois veio no prazo determinado . ... 2017-02-16 09:46:09 2017-03-01 16:42:31 2017-03-17 1 64b576fb70d441e8f1b2d7d446e483c5 varzea paulista SP Sudeste 15071729.0 26211520.0

5 rows × 26 columns

In [39]:
dtGeneralOrders = dtGeneralOrders[['customer_unique_id', 'order_id', 'order_purchase_timestamp', 'freight_value', 'payment_type', 'payment_installments', 'payment_value',\
                                 'review_score', 'order_status', 'order_item_id',\
                                  'diff_delivery_and_estimate', 'diff_delivery_and_purchase',\
                                  'customer_state', 'regiao']]
In [40]:
dtGeneralOrders.head()
Out[40]:
customer_unique_id order_id order_purchase_timestamp freight_value payment_type payment_installments payment_value review_score order_status order_item_id diff_delivery_and_estimate diff_delivery_and_purchase customer_state regiao
0 871766c5855e863f6eccc05f988b23cb 00010242fe8c5a6d1ba2dd792cb16214 2017-09-13 08:59:02 13.29 credit_card 2 72.19 5 delivered 1 8001612.0 8091514.0 RJ Sudeste
1 eb28e67c4c0b83846050ddfb8a35d051 00018f77f2f0320c557190d7a144bdd3 2017-04-26 10:53:06 19.93 credit_card 3 259.83 4 delivered 1 2075536.0 17184842.0 SP Sudeste
2 3818d81c6709e39d06b2738a8d3a2474 000229ec398224ef6ca0657da4fc703e 2018-01-14 14:33:31 17.87 credit_card 5 216.87 5 delivered 1 13104044.0 8011415.0 MG Sudeste
3 af861d436cfc08b2c2ddefd0ba074622 00024acbcdf0a6daa1e931b038114c75 2018-08-08 10:00:35 12.79 credit_card 2 25.78 4 delivered 1 5102721.0 7202756.0 SP Sudeste
4 64b576fb70d441e8f1b2d7d446e483c5 00042b26cf59d7ce69dfabb4e55b4fd9 2017-02-04 13:57:51 18.14 credit_card 3 218.04 5 delivered 1 15071729.0 26211520.0 SP Sudeste
In [41]:
dtGeneralOrders = dtGeneralOrders[dtGeneralOrders['order_status'] == 'delivered']
In [42]:
print(dtGeneralOrders.isna().sum())
customer_unique_id            0
order_id                      0
order_purchase_timestamp      0
freight_value                 0
payment_type                  0
payment_installments          0
payment_value                 0
review_score                  0
order_status                  0
order_item_id                 0
diff_delivery_and_estimate    8
diff_delivery_and_purchase    8
customer_state                0
regiao                        0
dtype: int64
In [43]:
# Dropando valores NA
dtGeneralOrders = dtGeneralOrders.dropna()
In [44]:
dtGeneralOrders = dtGeneralOrders.drop('order_status', axis = 1)
In [45]:
dtGeneralOrders.shape
Out[45]:
(115720, 13)
In [46]:
dtGeneralOrders.head()
Out[46]:
customer_unique_id order_id order_purchase_timestamp freight_value payment_type payment_installments payment_value review_score order_item_id diff_delivery_and_estimate diff_delivery_and_purchase customer_state regiao
0 871766c5855e863f6eccc05f988b23cb 00010242fe8c5a6d1ba2dd792cb16214 2017-09-13 08:59:02 13.29 credit_card 2 72.19 5 1 8001612.0 8091514.0 RJ Sudeste
1 eb28e67c4c0b83846050ddfb8a35d051 00018f77f2f0320c557190d7a144bdd3 2017-04-26 10:53:06 19.93 credit_card 3 259.83 4 1 2075536.0 17184842.0 SP Sudeste
2 3818d81c6709e39d06b2738a8d3a2474 000229ec398224ef6ca0657da4fc703e 2018-01-14 14:33:31 17.87 credit_card 5 216.87 5 1 13104044.0 8011415.0 MG Sudeste
3 af861d436cfc08b2c2ddefd0ba074622 00024acbcdf0a6daa1e931b038114c75 2018-08-08 10:00:35 12.79 credit_card 2 25.78 4 1 5102721.0 7202756.0 SP Sudeste
4 64b576fb70d441e8f1b2d7d446e483c5 00042b26cf59d7ce69dfabb4e55b4fd9 2017-02-04 13:57:51 18.14 credit_card 3 218.04 5 1 15071729.0 26211520.0 SP Sudeste
In [47]:
dtGeneralOrders = dtGeneralOrders.rename(columns = {'order_item_id': 'quantity', 'regiao': 'customer_region'})
In [48]:
dtGeneralOrders.head()
Out[48]:
customer_unique_id order_id order_purchase_timestamp freight_value payment_type payment_installments payment_value review_score quantity diff_delivery_and_estimate diff_delivery_and_purchase customer_state customer_region
0 871766c5855e863f6eccc05f988b23cb 00010242fe8c5a6d1ba2dd792cb16214 2017-09-13 08:59:02 13.29 credit_card 2 72.19 5 1 8001612.0 8091514.0 RJ Sudeste
1 eb28e67c4c0b83846050ddfb8a35d051 00018f77f2f0320c557190d7a144bdd3 2017-04-26 10:53:06 19.93 credit_card 3 259.83 4 1 2075536.0 17184842.0 SP Sudeste
2 3818d81c6709e39d06b2738a8d3a2474 000229ec398224ef6ca0657da4fc703e 2018-01-14 14:33:31 17.87 credit_card 5 216.87 5 1 13104044.0 8011415.0 MG Sudeste
3 af861d436cfc08b2c2ddefd0ba074622 00024acbcdf0a6daa1e931b038114c75 2018-08-08 10:00:35 12.79 credit_card 2 25.78 4 1 5102721.0 7202756.0 SP Sudeste
4 64b576fb70d441e8f1b2d7d446e483c5 00042b26cf59d7ce69dfabb4e55b4fd9 2017-02-04 13:57:51 18.14 credit_card 3 218.04 5 1 15071729.0 26211520.0 SP Sudeste
In [49]:
colunas_quantitativas = ['freight_value', 'payment_value', 'quantity',\
                         'diff_delivery_and_estimate', 'diff_delivery_and_purchase']
In [50]:
def outlier_detect(x, alpha = 0.05, nome = 'X'):
    n = len(x)
    mean_x = np.mean(x)
    std_x = np.std(x)
    numerator = max(abs(x - mean_x))
    
    # Grubbs test
    grubs_calculated = numerator / std_x
    
    # Valor calculado
    print('Grubs Test:', grubs_calculated)
    
    # Teste T
    t_value = stats.t.ppf(1 - alpha / (2 * n), n - 2)
    
    # Formula de Grubbs
    grubs_critical = ((n - 1) * np.sqrt(np.square(t_value))) / (np.sqrt(n) * np.sqrt(n - 2 + np.square(t_value)))
    
    # Valor Critico
    print('Grubbs Valor Critico:', grubs_critical)
    
    if grubs_critical > grubs_calculated:
        print('Para a variavel', nome, 'o valor calculado é MENOR que o valor critico. Aceitamos a Hipotese de que NÃO há outliers.')
    else:
        print('Para a variavel', nome, 'o valor calculado é MAIOR que o valor critico. Aceitamos a Hipotese de que HÁ há outliers.')
        
    print('\n')
In [51]:
for col in colunas_quantitativas:
    outlier_detect(dtGeneralOrders[col], nome = col)
Grubs Test: 24.79253432170545
Grubbs Valor Critico: 5.053985479091401
Para a variavel freight_value o valor calculado é MAIOR que o valor critico. Aceitamos a Hipotese de que HÁ há outliers.


Grubs Test: 50.79085515326686
Grubbs Valor Critico: 5.053985479091401
Para a variavel payment_value o valor calculado é MAIOR que o valor critico. Aceitamos a Hipotese de que HÁ há outliers.


Grubs Test: 17.61228664665197
Grubbs Valor Critico: 5.053985479091401
Para a variavel quantity o valor calculado é MAIOR que o valor critico. Aceitamos a Hipotese de que HÁ há outliers.


Grubs Test: 21.556369925308974
Grubbs Valor Critico: 5.053985479091401
Para a variavel diff_delivery_and_estimate o valor calculado é MAIOR que o valor critico. Aceitamos a Hipotese de que HÁ há outliers.


Grubs Test: 20.82824266694584
Grubbs Valor Critico: 5.053985479091401
Para a variavel diff_delivery_and_purchase o valor calculado é MAIOR que o valor critico. Aceitamos a Hipotese de que HÁ há outliers.


2.3 Preparando os Dados para Clusterização¶

2.3.1 Feature Engineering¶

Para essa etapa será realizado a criação de variáveis RFM para melhor analise do cluster.

  • Recência: Como recentemente o cliente fez uma transação conosco
  • Frequência: Quão frequente é o cliente em encomendar/comprar algum produto de nós
  • Monetário: Quanto o cliente gasta na compra de produtos de nós.
In [52]:
Rweight = 100.0
Fweight = 1.0
Mweight = 10.0
In [53]:
def r_score(x):
    if x <= quintiles['Recency'][.2]:
        return 5
    elif x <= quintiles['Recency'][.4]:
        return 4
    elif x <= quintiles['Recency'][.6]:
        return 3
    elif x <= quintiles['Recency'][.8]:
        return 2
    else:
        return 1

def fm_score(x, c):
    if x <= quintiles[c][.2]:
        return 1
    elif x <= quintiles[c][.4]:
        return 2
    elif x <= quintiles[c][.6]:
        return 3
    elif x <= quintiles[c][.8]:
        return 4
    else:
        return 5    
    
segt_map = {
    r'[1-2][1-2]': 'Hibernando',
    r'[1-2][3-4]': 'Em risco',
    r'[1-2]5': 'Não pode perder',
    r'3[1-2]': 'Quase dormindo',
    r'33': 'Precisam de atenção',
    r'[3-4][4-5]': 'Clientes Leais',
    r'41': 'Promissores',
    r'51': 'Novos Clientes',
    r'[4-5][2-3]': 'Potenciais Clientes Leais',
    r'5[4-5]': 'Campeões'
}
In [54]:
# Calcular Recência
dtRFM_R = dtGeneralOrders.groupby(by = 'customer_unique_id', as_index = False)['order_purchase_timestamp'].max()
dtRFM_R.rename(columns = {"order_purchase_timestamp": "LastPurchaseDate"}, inplace = True)
In [55]:
# Mantem somente yyyy-MM-dd
dtRFM_R["LastPurchaseDate"] = dtRFM_R["LastPurchaseDate"].dt.date
In [56]:
# Get recent invoice date and use it to calculate recency
mostRecentDate = dtGeneralOrders['order_purchase_timestamp'].dt.date.max()
dtRFM_R['Recency'] = dtRFM_R['LastPurchaseDate'].apply(lambda x: (mostRecentDate - x).days)
In [57]:
# Calculando frequência
dtRFM_F = dtGeneralOrders.groupby(["customer_unique_id"]).agg({"order_id":"nunique"}).reset_index()
dtRFM_F.rename(columns = {"order_id":"Frequency"}, inplace = True)
In [58]:
# Calculando valor monetário
dtRFM_M = dtGeneralOrders.groupby('customer_unique_id', as_index = False)['payment_value'].sum()
dtRFM_M.columns = ['customer_unique_id', 'Monetary']
In [59]:
# Merging dfs
dtRFM = dtRFM_R.merge(dtRFM_F, on = 'customer_unique_id')
dtRFM = dtRFM.merge(dtRFM_M, on = 'customer_unique_id').drop(columns = 'LastPurchaseDate')
In [60]:
dtRFM.head()
Out[60]:
customer_unique_id Recency Frequency Monetary
0 0000366f3b9a7992bf8c76cfdf3221e2 111 1 141.90
1 0000b849f77a49e4a4ce2b2a4ca5be3f 114 1 27.19
2 0000f46a3911fa3c0805444483337064 537 1 86.22
3 0000f6ccb0745a6a4b88665a16c9f078 321 1 43.62
4 0004aac84e0df4da2b147fca70cf8255 288 1 196.89
In [61]:
dtRFM.isna().sum()
Out[61]:
customer_unique_id    0
Recency               0
Frequency             0
Monetary              0
dtype: int64
In [62]:
quintiles = dtRFM[['Recency', 'Frequency', 'Monetary']].quantile([.2, .4, .6, .8]).to_dict()
In [63]:
dtRFM['R'] = dtRFM['Recency'].apply(lambda x: r_score(x))
dtRFM['F'] = dtRFM['Frequency'].apply(lambda x: fm_score(x, 'Frequency'))
dtRFM['M'] = dtRFM['Monetary'].apply(lambda x: fm_score(x, 'Monetary'))
In [64]:
dtRFM['Linear_Segment'] = dtRFM['R'].map(str) + dtRFM['F'].map(str) + dtRFM['M'].map(str)
In [65]:
dtRFM['RFM_Score'] = ((dtRFM['R'] * Rweight) + (dtRFM['F'] * Fweight) + (dtRFM['M'] * Mweight)) / (Rweight + Fweight + Mweight)
In [66]:
dtRFM['SegmentByR&F'] = dtRFM['R'].map(str) + dtRFM['F'].map(str)
dtRFM['SegmentByR&F'] = dtRFM['SegmentByR&F'].replace(segt_map, regex = True)
In [67]:
dtRFMGrouped = dtRFM.groupby('SegmentByR&F').count()['customer_unique_id'].reset_index()
In [68]:
dtRFMGrouped
Out[68]:
SegmentByR&F customer_unique_id
0 Campeões 627
1 Clientes Leais 1192
2 Hibernando 36091
3 Novos Clientes 18169
4 Não pode perder 982
5 Promissores 18170
6 Quase dormindo 18118
In [69]:
fig = px.treemap(dtRFMGrouped, path = ['SegmentByR&F'], 
                 values = 'customer_unique_id')
fig.data[0].textinfo = 'label + text + value + percent parent'

fig.show()
In [70]:
dtRFM["SegmentByRFMScore"] = np.where(dtRFM['RFM_Score'] > 4.5, "Campeões", (np.where(dtRFM['RFM_Score'] > 4, "Alto Valor", \
                             (np.where(dtRFM['RFM_Score'] > 3, "Valor Médio", np.where(dtRFM['RFM_Score'] > 1.6, \
                              'Baixo Valor', 'Perdido'))))))
In [71]:
dtRFMGrouped2 = dtRFM.groupby('SegmentByRFMScore').count()['customer_unique_id'].reset_index()
In [72]:
fig = px.treemap(dtRFMGrouped2, path = ['SegmentByRFMScore'], 
                 values = 'customer_unique_id')
fig.data[0].textinfo = 'label + text + value + percent parent'

fig.show()
In [73]:
dtGeneralOrders = dtGeneralOrders.merge(dtRFM, on = 'customer_unique_id')

2.3.1 LabelEncoder¶

In [74]:
dtGeneralOrders.head()
Out[74]:
customer_unique_id order_id order_purchase_timestamp freight_value payment_type payment_installments payment_value review_score quantity diff_delivery_and_estimate ... Recency Frequency Monetary R F M Linear_Segment RFM_Score SegmentByR&F SegmentByRFMScore
0 871766c5855e863f6eccc05f988b23cb 00010242fe8c5a6d1ba2dd792cb16214 2017-09-13 08:59:02 13.29 credit_card 2 72.19 5 1 8001612.0 ... 350 1 72.19 2 1 2 212 1.990991 Hibernando Baixo Valor
1 eb28e67c4c0b83846050ddfb8a35d051 00018f77f2f0320c557190d7a144bdd3 2017-04-26 10:53:06 19.93 credit_card 3 259.83 4 1 2075536.0 ... 345 2 284.56 2 5 5 255 2.297297 Não pode perder Baixo Valor
2 eb28e67c4c0b83846050ddfb8a35d051 c105be01f21c9df6a2579f59b95cbcd9 2017-09-18 16:12:04 11.85 credit_card 1 24.73 5 1 8020014.0 ... 345 2 284.56 2 5 5 255 2.297297 Não pode perder Baixo Valor
3 3818d81c6709e39d06b2738a8d3a2474 000229ec398224ef6ca0657da4fc703e 2018-01-14 14:33:31 17.87 credit_card 5 216.87 5 1 13104044.0 ... 227 1 216.87 3 1 4 314 3.072072 Quase dormindo Valor Médio
4 af861d436cfc08b2c2ddefd0ba074622 00024acbcdf0a6daa1e931b038114c75 2018-08-08 10:00:35 12.79 credit_card 2 25.78 4 1 5102721.0 ... 21 1 25.78 5 1 1 511 4.603604 Novos Clientes Campeões

5 rows × 23 columns

In [75]:
dtGeneralOrdersProcessado = dtGeneralOrders.copy()
In [76]:
# Realizando LabelEncoder
labelencoderPaymentType = LabelEncoder()
labelencoderCustomerState = LabelEncoder()
labelencoderCustomerRegion = LabelEncoder()

dtGeneralOrdersProcessado['payment_type'] = labelencoderPaymentType.fit_transform(dtGeneralOrders['payment_type'])

dtGeneralOrdersProcessado['customer_state'] =\
                labelencoderCustomerState.fit_transform(dtGeneralOrders['customer_state'])

dtGeneralOrdersProcessado['customer_region'] =\
                labelencoderCustomerRegion.fit_transform(dtGeneralOrders['customer_region'])
In [77]:
dtGeneralOrdersProcessado.head()
Out[77]:
customer_unique_id order_id order_purchase_timestamp freight_value payment_type payment_installments payment_value review_score quantity diff_delivery_and_estimate ... Recency Frequency Monetary R F M Linear_Segment RFM_Score SegmentByR&F SegmentByRFMScore
0 871766c5855e863f6eccc05f988b23cb 00010242fe8c5a6d1ba2dd792cb16214 2017-09-13 08:59:02 13.29 1 2 72.19 5 1 8001612.0 ... 350 1 72.19 2 1 2 212 1.990991 Hibernando Baixo Valor
1 eb28e67c4c0b83846050ddfb8a35d051 00018f77f2f0320c557190d7a144bdd3 2017-04-26 10:53:06 19.93 1 3 259.83 4 1 2075536.0 ... 345 2 284.56 2 5 5 255 2.297297 Não pode perder Baixo Valor
2 eb28e67c4c0b83846050ddfb8a35d051 c105be01f21c9df6a2579f59b95cbcd9 2017-09-18 16:12:04 11.85 1 1 24.73 5 1 8020014.0 ... 345 2 284.56 2 5 5 255 2.297297 Não pode perder Baixo Valor
3 3818d81c6709e39d06b2738a8d3a2474 000229ec398224ef6ca0657da4fc703e 2018-01-14 14:33:31 17.87 1 5 216.87 5 1 13104044.0 ... 227 1 216.87 3 1 4 314 3.072072 Quase dormindo Valor Médio
4 af861d436cfc08b2c2ddefd0ba074622 00024acbcdf0a6daa1e931b038114c75 2018-08-08 10:00:35 12.79 1 2 25.78 4 1 5102721.0 ... 21 1 25.78 5 1 1 511 4.603604 Novos Clientes Campeões

5 rows × 23 columns

2.3.2 Normalização / Padronização¶

In [78]:
colunas_quantitativas = ['freight_value', 'payment_value', 'quantity',\
                         'diff_delivery_and_estimate', 'diff_delivery_and_purchase', 'Recency',\
                         'Frequency', 'Monetary', 'RFM_Score']
In [79]:
def boxplot_individuais(data, columns, width = 15, height = 8):
    fig = plt.figure()
    fig.subplots_adjust(hspace = 0.4, wspace = 0.4)
    fig.set_figheight(8)
    fig.set_figwidth(15)
    
    columns_adjust = ceil(len(columns) / 3)
    
    for i, column in enumerate(columns):
        ax = fig.add_subplot(columns_adjust, 3, i + 1)
        sns.boxplot(x = data[column])
        
    plt.tight_layout()
    plt.show()
In [80]:
def hist_individual(data, columns, width = 10, height = 15):
    fig = plt.figure()
    fig.subplots_adjust(hspace = 0.4, wspace = 0.4)
    fig.set_figheight(10)
    fig.set_figwidth(15)
    
    columns_adjust = ceil(len(columns) / 3)
    
    for i, column in enumerate(columns):
        ax = fig.add_subplot(columns_adjust, 3, i + 1)
        data[column].hist(label = column)
        plt.title(column)
        
    plt.tight_layout()  
    plt.show()
In [81]:
boxplot_individuais(dtGeneralOrdersProcessado, colunas_quantitativas)
In [82]:
hist_individual(dtGeneralOrdersProcessado, colunas_quantitativas)
In [83]:
scaler = StandardScaler()
#scaler = MinMaxScaler()
In [84]:
dtGeneralOrdersNormalizado = dtGeneralOrdersProcessado.copy()
In [85]:
dtGeneralOrdersNormalizado[colunas_quantitativas] = scaler.fit_transform(dtGeneralOrdersProcessado[colunas_quantitativas])
In [86]:
dtGeneralOrdersNormalizado.head()
Out[86]:
customer_unique_id order_id order_purchase_timestamp freight_value payment_type payment_installments payment_value review_score quantity diff_delivery_and_estimate ... Recency Frequency Monetary R F M Linear_Segment RFM_Score SegmentByR&F SegmentByRFMScore
0 871766c5855e863f6eccc05f988b23cb 00010242fe8c5a6d1ba2dd792cb16214 2017-09-13 08:59:02 -0.425764 1 2 -0.375019 5 -0.353985 -0.568903 ... 0.733304 -0.223342 -0.184848 2 1 2 212 -0.793074 Hibernando Baixo Valor
1 eb28e67c4c0b83846050ddfb8a35d051 00018f77f2f0320c557190d7a144bdd3 2017-04-26 10:53:06 -0.003328 1 3 0.331341 4 -0.353985 -1.293293 ... 0.700521 2.322229 -0.058797 2 5 5 255 -0.554388 Não pode perder Baixo Valor
2 eb28e67c4c0b83846050ddfb8a35d051 c105be01f21c9df6a2579f59b95cbcd9 2017-09-18 16:12:04 -0.517377 1 1 -0.553679 5 -0.353985 -0.566653 ... 0.700521 2.322229 -0.058797 2 5 5 255 -0.554388 Não pode perder Baixo Valor
3 3818d81c6709e39d06b2738a8d3a2474 000229ec398224ef6ca0657da4fc703e 2018-01-14 14:33:31 -0.134385 1 5 0.169621 5 -0.353985 0.054807 ... -0.073163 -0.223342 -0.098974 3 1 4 314 0.049347 Quase dormindo Valor Médio
4 af861d436cfc08b2c2ddefd0ba074622 00024acbcdf0a6daa1e931b038114c75 2018-08-08 10:00:35 -0.457574 1 2 -0.549727 4 -0.353985 -0.923257 ... -1.423831 -0.223342 -0.212395 5 1 1 511 1.242777 Novos Clientes Campeões

5 rows × 23 columns

In [87]:
boxplot_individuais(dtGeneralOrdersNormalizado, colunas_quantitativas)
In [88]:
hist_individual(dtGeneralOrdersNormalizado, colunas_quantitativas)

2.3.3 Simetria dos Dados¶

2.3.3.1 Skewness¶

  • Se skewness é menor que −1 ou maior que +1, a distribuição é 'highly skewed'.
  • Se skewness esta entre −1 e −½ ou entre +½ e +1, a distribuição é 'moderately skewed'.
  • Se skewness esta entre −½ e +½, a distribuição é aproximadaente simetrica.
In [89]:
print(dtGeneralOrdersProcessado[colunas_quantitativas].skew(),\
      '\nSoma:', sum(abs(dtGeneralOrdersProcessado[colunas_quantitativas].skew())))
freight_value                  5.569890
payment_value                 14.411085
quantity                       6.339456
diff_delivery_and_estimate     2.932376
diff_delivery_and_purchase     3.844929
Recency                        0.447031
Frequency                     11.008832
Monetary                      29.531411
RFM_Score                     -0.003545
dtype: float64 
Soma: 74.08855421395997
In [90]:
print(dtGeneralOrdersNormalizado[colunas_quantitativas].skew(),\
      '\nSoma:', sum(abs(dtGeneralOrdersNormalizado[colunas_quantitativas].skew())))
freight_value                  5.569890
payment_value                 14.411085
quantity                       6.339456
diff_delivery_and_estimate     2.932376
diff_delivery_and_purchase     3.844929
Recency                        0.447031
Frequency                     11.008832
Monetary                      29.531411
RFM_Score                     -0.003545
dtype: float64 
Soma: 74.08855421395998

2.3.3.2 Kurtosis¶

  • Mesokurtic -> Kurtosis ~= 0: Distribuição normal.
  • Leptokurtic -> Kurtosis > 0: Valores proximos a media ou dos extremos.
  • Platykurtic -> Kurtosis < 0: Valores muito espalhados.
In [91]:
print(dtGeneralOrdersProcessado[colunas_quantitativas].kurtosis(),\
      '\nSoma:', sum(abs(dtGeneralOrdersProcessado[colunas_quantitativas].kurtosis())))
freight_value                   58.620450
payment_value                  528.826822
quantity                        68.674498
diff_delivery_and_estimate      33.761390
diff_delivery_and_purchase      39.948358
Recency                         -0.656545
Frequency                      258.347088
Monetary                      1432.134347
RFM_Score                       -1.271986
dtype: float64 
Soma: 2422.2414848619846
In [92]:
print(dtGeneralOrdersNormalizado[colunas_quantitativas].kurtosis(),\
      '\nSoma:', sum(abs(dtGeneralOrdersNormalizado[colunas_quantitativas].kurtosis())))
freight_value                   58.620450
payment_value                  528.826822
quantity                        68.674498
diff_delivery_and_estimate      33.761390
diff_delivery_and_purchase      39.948358
Recency                         -0.656545
Frequency                      258.347088
Monetary                      1432.134347
RFM_Score                       -1.271986
dtype: float64 
Soma: 2422.2414848619874

2.3.3 Correlação¶

In [93]:
def scatter_plot_conjunto(data, columns, target, rowN = 2, colN = 2):
    
    col_nums = 1
    row_ = 0
    col_ = 0
    
    # Definindo range de Y
    y_range = [data[target].min(), data[target].max()]
    
    fig, axes = plt.subplots(nrows = rowN, ncols = colN, figsize=(12, 12))
    
    for column in columns:
        if target != column:
            # Definindo range de X
            x_range = [data[column].min(), data[column].max()]
            
            # Scatter plot de X e Y
            scatter_plot = data.plot(kind = 'scatter', x = column, y = target, xlim = x_range, ylim = y_range, ax = axes[row_, col_])
            
            # Traçar linha da media de X e Y
            meanX = scatter_plot.plot(x_range, [data[target].mean(), data[target].mean()], '--', color = 'red', linewidth = 1)
            meanY = scatter_plot.plot([data[column].mean(), data[column].mean()], y_range, '--', color = 'red', linewidth = 1)
            
            col_ += 1
            if col_ > col_nums:
                col_ = 0
                row_ += 1
In [94]:
plt.figure(figsize = (12, 12))
sns.heatmap(dtGeneralOrdersProcessado[colunas_quantitativas].corr(), annot =  True)
plt.show()
In [95]:
scatter_plot_conjunto(dtGeneralOrdersProcessado, colunas_quantitativas, 'freight_value', rowN = 4)

2.4 Analise de Clusterização¶

In [96]:
dtGeneralOrdersFinal = dtGeneralOrdersNormalizado.drop(
    ['order_id', 'customer_unique_id', 'order_purchase_timestamp', 'Recency', 'Frequency', 'Monetary', 'R', 'F', 'M',\
    'Linear_Segment', 'SegmentByR&F', 'SegmentByRFMScore', 'diff_delivery_and_estimate', 'diff_delivery_and_purchase'], axis = 1)
In [97]:
dtGeneralOrdersFinal.head()
Out[97]:
freight_value payment_type payment_installments payment_value review_score quantity customer_state customer_region RFM_Score
0 -0.425764 1 2 -0.375019 5 -0.353985 18 3 -0.793074
1 -0.003328 1 3 0.331341 4 -0.353985 25 3 -0.554388
2 -0.517377 1 1 -0.553679 5 -0.353985 25 3 -0.554388
3 -0.134385 1 5 0.169621 5 -0.353985 10 3 0.049347
4 -0.457574 1 2 -0.549727 4 -0.353985 25 3 1.242777
In [98]:
dtGeneralOrdersFinal[dtGeneralOrdersFinal.columns.difference(colunas_quantitativas)].head()
Out[98]:
customer_region customer_state payment_installments payment_type review_score
0 3 18 2 1 5
1 3 25 3 1 4
2 3 25 1 1 5
3 3 10 5 1 5
4 3 25 2 1 4

2.4.1 Elbow¶

In [99]:
# Determinando um range de K
k_range = range(1, 12)
In [100]:
# Aplicando o modelo K-Means para cada valor de K
k_means_var = [KMeans(n_clusters = k).fit(dtGeneralOrdersFinal) for k in k_range]
In [101]:
# Ajustando o centróide do cluster para cada modelo
intertia = [X.inertia_ for X in k_means_var]

Analisando a metrica de Elbow é verificado que o numero ideal para K seria 6. Pois, a queda da distancia passa a diminuir fortemente em K = 4, e em 6 tem uma redução brusca novamente.

In [102]:
# Lista de valores de Inertia (Inertia e WCSS são a mesma coisa)
inertia = []

# Loop para testar os valores de K
for n in range(2 , 12):
    modelo = (KMeans(n_clusters = n,
                     init = 'k-means++', 
                     n_init = 10,
                     max_iter = 300,
                     tol = 0.0001,  
                     random_state = seed_, 
                     algorithm = 'elkan'))
    modelo.fit(dtGeneralOrdersFinal)
    inertia.append(modelo.inertia_)

# Plot
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(2 , 12) , inertia , 'o')
plt.plot(np.arange(2 , 12) , inertia , '-' , alpha = 0.5)
plt.xlabel('Número de Clusters') , plt.ylabel('Inertia')
plt.show()
In [103]:
dtGeneralOrdersTemp = dtGeneralOrdersFinal[:57028].copy()
In [104]:
# Determinando um range de K
k_range = range(1, 10)
In [105]:
# Aplicando o modelo K-Means para cada valor de K (esta célula pode levar bastante tempo para ser executada)
k_means_var = [KMeans(n_clusters = k).fit(dtGeneralOrdersTemp) for k in k_range]
In [106]:
# Ajustando o centróide do cluster para cada modelo
centroids = [X.cluster_centers_ for X in k_means_var]
In [107]:
# Calculando a distância euclidiana de cada ponto de dado para o centróide
k_euclid = [cdist(dtGeneralOrdersTemp, cent, 'euclidean') for cent in centroids]
dist = [np.min(ke, axis = 1) for ke in k_euclid]
In [108]:
# Soma dos quadrados das distâncias dentro do cluster
soma_quadrados_intra_cluster = [sum(d**2) for d in dist]
In [109]:
# Soma total dos quadrados
soma_total = sum(pdist(dtGeneralOrdersTemp)**2)/dtGeneralOrdersTemp.shape[0]
In [110]:
# Soma dos quadrados entre clusters
soma_quadrados_inter_cluster = soma_total - soma_quadrados_intra_cluster

Novamente verificando a variancia dos dados de acordo com cada valor de K, encontramos que o valor ideal é 6.

In [111]:
# Curva de Elbow
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(k_range, soma_quadrados_inter_cluster/soma_total * 100, 'b*-')
ax.set_ylim((0,100))
plt.grid(True)
plt.xlabel('Número de Clusters')
plt.ylabel('Percentual de Variância Explicada')
plt.title('Variância Explicada x Valor de K')
plt.savefig('export/variancia_explicada.png')
plt.show()

2.4.2 Modelando Clusters KMeans - Com Redução de Componentes¶

In [112]:
def testa_k(pca_data, k = 2):
    # Criando um modelo
    modelo = KMeans(n_clusters = k)
    modelo.fit(pca_data)
    
    # Obtém os valores mínimos e máximos e organiza o shape
    x_min, x_max = pca_data[:, 0].min() - 5, pca_data[:, 0].max() - 1
    y_min, y_max = pca_data[:, 1].min() + 1, pca_data[:, 1].max() + 5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))
    Z = modelo.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot das áreas dos clusters
    plt.figure(figsize = (8, 8))
    plt.clf()
    plt.imshow(Z, 
               interpolation = 'nearest',
               extent = (xx.min(), xx.max(), yy.min(), yy.max()),
               cmap = plt.cm.Paired,
               aspect = 'auto', 
               origin = 'lower')
    plt.savefig('export/meshgrid_' + str(k) + 'k.png')
    plt.show()
    
    # Plot dos centróides
    plt.figure(figsize = (13, 13))
    plt.plot(pca_data[:, 0], pca_data[:, 1], 'k.', markersize = 4)
    centroids = modelo.cluster_centers_
    inert = modelo.inertia_
    plt.scatter(centroids[:, 0], centroids[:, 1], marker = 'x', s = 600, linewidths = 100, color = 'r', zorder = 8)
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xticks(())
    plt.yticks(())
    plt.savefig('export/centroides_' + str(k) + 'k.png')
    plt.show()
    
    # Silhouette Score
    labels = modelo.labels_ 
    print(f"Silhouette Score: {silhouette_score(pca_data, labels, metric = 'euclidean', random_state = seed_)}")
In [113]:
# Aplica redução de dimensionalidade
pca = PCA(n_components = 2).fit_transform(dtGeneralOrdersFinal)
In [114]:
testa_k(pca, 4)
Silhouette Score: 0.6093958650516936
In [115]:
testa_k(pca, 5)
Silhouette Score: 0.6026403332796946
In [116]:
testa_k(pca, 6)
Silhouette Score: 0.6113627678449342
In [117]:
testa_k(pca, 8)
Silhouette Score: 0.5798900430267735
In [118]:
testa_k(pca, 10)
Silhouette Score: 0.6118385309447221

2.4.3 Modelando Clusters KMeans - Sem Redução de Componentes¶

In [119]:
def treina_modelo(k, data, silhouette = True):
    # Criando um modelo
    modelo = KMeans(n_clusters = k)
    
    modelo.fit(data)
    
    if silhouette:
        # Silhouette Score
        labels = modelo.labels_ 
        print(f"Silhouette Score: {silhouette_score(data, labels, metric = 'euclidean', random_state = seed_)}")
In [120]:
# Modelo com k = 4
treina_modelo(4, dtGeneralOrdersFinal)
Silhouette Score: 0.4455774376463494
In [121]:
# Modelo com k = 5
treina_modelo(5, dtGeneralOrdersFinal)
Silhouette Score: 0.4307782879947419
In [122]:
# Modelo com k = 6
treina_modelo(6, dtGeneralOrdersFinal)
Silhouette Score: 0.4051423836510225
In [123]:
# Modelo com k = 8
treina_modelo(8, dtGeneralOrdersFinal)
Silhouette Score: 0.3257378455573458

2.4.4 Analise de Clusters - 1¶

In [124]:
# Criando um modelo
modelo = KMeans(n_clusters = 4, random_state = seed_)
modelo.fit(dtGeneralOrdersFinal)
Out[124]:
KMeans(n_clusters=4, random_state=194)
In [125]:
dtGeneralOrdersClusterizado = dtGeneralOrdersProcessado.copy()
In [126]:
dtGeneralOrdersClusterizado = dtGeneralOrdersClusterizado.drop('order_id', axis = 1)
In [127]:
dtGeneralOrdersClusterizado['cluster'] = modelo.labels_
In [128]:
dtGeneralOrdersClusterizado['payment_type'] =\
                    labelencoderPaymentType.inverse_transform(dtGeneralOrdersClusterizado['payment_type'])
In [129]:
dtGeneralOrdersClusterizado['customer_state'] =\
                    labelencoderCustomerState.inverse_transform(dtGeneralOrdersClusterizado['customer_state'])
In [130]:
dtGeneralOrdersClusterizado['customer_region'] =\
                    labelencoderCustomerRegion.inverse_transform(dtGeneralOrdersClusterizado['customer_region'])
In [131]:
metodos_pagamento = {
    'credit_card': 'Crédito',
    'debit_card': 'Débito'
}
In [132]:
dtGeneralOrdersClusterizado = dtGeneralOrdersClusterizado.replace({'payment_type': metodos_pagamento})
In [133]:
dtGeneralOrdersClusterizado.head()
Out[133]:
customer_unique_id order_purchase_timestamp freight_value payment_type payment_installments payment_value review_score quantity diff_delivery_and_estimate diff_delivery_and_purchase ... Frequency Monetary R F M Linear_Segment RFM_Score SegmentByR&F SegmentByRFMScore cluster
0 871766c5855e863f6eccc05f988b23cb 2017-09-13 08:59:02 13.29 Crédito 2 72.19 5 1 8001612.0 8091514.0 ... 1 72.19 2 1 2 212 1.990991 Hibernando Baixo Valor 0
1 eb28e67c4c0b83846050ddfb8a35d051 2017-04-26 10:53:06 19.93 Crédito 3 259.83 4 1 2075536.0 17184842.0 ... 2 284.56 2 5 5 255 2.297297 Não pode perder Baixo Valor 1
2 eb28e67c4c0b83846050ddfb8a35d051 2017-09-18 16:12:04 11.85 Crédito 1 24.73 5 1 8020014.0 8181218.0 ... 2 284.56 2 5 5 255 2.297297 Não pode perder Baixo Valor 1
3 3818d81c6709e39d06b2738a8d3a2474 2018-01-14 14:33:31 17.87 Crédito 5 216.87 5 1 13104044.0 8011415.0 ... 1 216.87 3 1 4 314 3.072072 Quase dormindo Valor Médio 2
4 af861d436cfc08b2c2ddefd0ba074622 2018-08-08 10:00:35 12.79 Crédito 2 25.78 4 1 5102721.0 7202756.0 ... 1 25.78 5 1 1 511 4.603604 Novos Clientes Campeões 1

5 rows × 23 columns

In [134]:
dtClusters = pd.DataFrame()
In [135]:
dtClusters['Preço'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['payment_value'].mean(), 2)
In [136]:
dtClusters['Frete'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['freight_value'].mean(), 2)
In [137]:
dtClusters['Parcelas'] =\
                round(dtGeneralOrdersClusterizado.groupby('cluster')['payment_installments'].mean(), 2)
In [138]:
dtClusters['Score'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['review_score'].mean(), 2)
In [139]:
dtClusters['Itens'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['quantity'].mean(), 2)
In [140]:
dtClusters['RFM'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['RFM_Score'].mean(), 2)
In [141]:
dtClusters['Pagamento'] =\
                dtGeneralOrdersClusterizado.groupby('cluster')['payment_type'].agg(lambda x: x.value_counts().index[0])
In [142]:
dtClusters['Estado'] =\
                dtGeneralOrdersClusterizado.groupby('cluster')['customer_state'].agg(lambda x: x.value_counts().index[0])
In [143]:
dtClusters['Região'] =\
                dtGeneralOrdersClusterizado.groupby('cluster')['customer_region'].agg(lambda x: x.value_counts().index[0])
In [144]:
dtClusters
Out[144]:
Preço Frete Parcelas Score Itens RFM Pagamento Estado Região
cluster
0 165.98 23.13 2.40 3.96 1.33 2.95 Crédito RJ Sudeste
1 130.20 15.62 1.77 4.16 1.37 3.05 Crédito SP Sudeste
2 187.74 23.64 3.16 4.01 1.38 2.97 Crédito MG Sudeste
3 320.17 23.27 8.41 3.98 1.67 3.03 Crédito SP Sudeste
In [145]:
dtClusters.to_csv('analise_clusters_geral.csv')
In [146]:
dtClustersIndexado = dtClusters.reset_index()
In [147]:
dtClustersIndexado = dtClustersIndexado.rename(columns = {'cluster': 'Cluster'})
In [148]:
dtClustersIndexado['Cluster'] = dtClustersIndexado['Cluster'].apply(lambda x: 'Cluster ' + str(x))
In [149]:
dtClustersIndexado
Out[149]:
Cluster Preço Frete Parcelas Score Itens RFM Pagamento Estado Região
0 Cluster 0 165.98 23.13 2.40 3.96 1.33 2.95 Crédito RJ Sudeste
1 Cluster 1 130.20 15.62 1.77 4.16 1.37 3.05 Crédito SP Sudeste
2 Cluster 2 187.74 23.64 3.16 4.01 1.38 2.97 Crédito MG Sudeste
3 Cluster 3 320.17 23.27 8.41 3.98 1.67 3.03 Crédito SP Sudeste
In [150]:
headerColor = 'grey'
rowEvenColor = 'lightgrey'
rowOddColor = 'white'

fig = go.Figure(data=[go.Table(
  header=dict(
    values=list(dtClustersIndexado.columns),
    line_color='darkslategray',
    fill_color=headerColor,
    align=['left','center'],
    font=dict(color='white', size=12)
  ),
  cells=dict(
    values=[dtClustersIndexado.Cluster, dtClustersIndexado['Preço'], dtClustersIndexado.Frete, dtClustersIndexado.Parcelas,\
       dtClustersIndexado.Score, dtClustersIndexado.Itens, dtClustersIndexado.RFM,\
       dtClustersIndexado.Pagamento, dtClustersIndexado.Estado,\
       dtClustersIndexado['Região']],
    line_color='darkslategray',
    
    fill_color = [[rowOddColor,rowEvenColor,rowOddColor, rowEvenColor]*10],
    align = ['left', 'center'],
    font = dict(color = 'darkslategray', size = 11)
    ))
])

fig.show()

2.4.5 Analise de Clusters - 2¶

In [151]:
dtGeneralOrdersCluster2 = dtGeneralOrdersNormalizado[['Recency', 'Frequency', 'Monetary', 'RFM_Score']]
In [152]:
dtGeneralOrdersCluster2.head()
Out[152]:
Recency Frequency Monetary RFM_Score
0 0.733304 -0.223342 -0.184848 -0.793074
1 0.700521 2.322229 -0.058797 -0.554388
2 0.700521 2.322229 -0.058797 -0.554388
3 -0.073163 -0.223342 -0.098974 0.049347
4 -1.423831 -0.223342 -0.212395 1.242777
In [153]:
# Criando um modelo
modelo = KMeans(n_clusters = 4, random_state = seed_)
modelo.fit(dtGeneralOrdersCluster2)
Out[153]:
KMeans(n_clusters=4, random_state=194)
In [154]:
dtGeneralOrdersClusterizado = dtGeneralOrdersProcessado.copy()
In [155]:
dtGeneralOrdersClusterizado['cluster'] = modelo.labels_
In [156]:
dtClusters = pd.DataFrame()
In [157]:
dtClusters['Recency'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['Recency'].mean(), 2)
In [158]:
dtClusters['Frequency'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['Frequency'].mean(), 2)
In [159]:
dtClusters['Monetary'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['Monetary'].mean(), 2)
In [160]:
dtClusters['RFM'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['RFM_Score'].mean(), 2)
In [161]:
dtClusters
Out[161]:
Recency Frequency Monetary RFM
cluster
0 395.67 1.00 296.31 1.64
1 133.88 1.00 296.32 3.90
2 221.00 2.23 682.74 3.27
3 221.85 1.49 26987.42 3.20
In [162]:
dtClusters.to_csv('analise_clusters_RFM.csv')
In [163]:
dtClustersIndexado = dtClusters.reset_index()
dtClustersIndexado = dtClustersIndexado.rename(columns = {'cluster': 'Cluster'})
dtClustersIndexado['Cluster'] = dtClustersIndexado['Cluster'].apply(lambda x: 'Cluster ' + str(x))
dtClustersIndexado
Out[163]:
Cluster Recency Frequency Monetary RFM
0 Cluster 0 395.67 1.00 296.31 1.64
1 Cluster 1 133.88 1.00 296.32 3.90
2 Cluster 2 221.00 2.23 682.74 3.27
3 Cluster 3 221.85 1.49 26987.42 3.20
In [164]:
headerColor = 'grey'
rowEvenColor = 'lightgrey'
rowOddColor = 'white'

fig = go.Figure(data=[go.Table(
  header=dict(
    values=list(dtClustersIndexado.columns),
    line_color='darkslategray',
    fill_color=headerColor,
    align=['left','center'],
    font=dict(color='white', size=12)
  ),
  cells=dict(
    values=[dtClustersIndexado.Cluster, dtClustersIndexado.Recency, dtClustersIndexado.Frequency, dtClustersIndexado.Monetary\
           , dtClustersIndexado.RFM],
    line_color='darkslategray',
    
    fill_color = [[rowOddColor,rowEvenColor,rowOddColor, rowEvenColor]*10],
    align = ['left', 'center'],
    font = dict(color = 'darkslategray', size = 11)
    ))
])

fig.show()
In [ ]:
 
In [ ]: